Special Linguistic Phenomena in the Bulgarian HPSG-based Treebank (BulTreeBank)

نویسنده

Petya Osenova

چکیده

Currently the BuTreeBank comprises 214 000 tokens, a little more than 15 000 sentences. Each token is annotated with morphosyntactic information. Additionally the Named Entities are annotated with ontological classes as person, organization, location, and other. Based on HPSG theory the annotation scheme defines a number of phrase types which reflect both the constituent structure and the head-dependant relation. Thus we have phrase labels with the explication of the dependant types like VPC (verbal head complement phrase), VPS (verbal head subject phrase), VPA (verbal head adjunct phrase), NPA (nominal head adjunct phrase) etc. Behind the constituent structures and the head-dependant relations the treebank also represents phenomena like coordination, ellipsis, pro-dropness, word order, secondary predication, control – see (Simov and Osenova 2003). We will focus on some of them in this demo presentation. The treebank is encoded in XML.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

13 HPSG - based syntactic treebank of Bulgarian ( BulTreeBank )

متن کامل

Practical Annotation Scheme for an HPSG Treebank of Bulgarian

The paper presents an HPSG-based annotation scheme for constructing a Bulgarian treebank: BulTreeBank. It differs from other grammar-based annotation schemes in having a hybrid status with respect to the partial parsing component and the full parsing module. As the parsing complexity is handled preferably by the pre-processing step, the task of the HPSG module is maximally facilitated and simpl...

متن کامل

A Data-Driven Dependency Parser for Bulgarian

One of the main motivations for building treebanks is that they facilitate the development of syntactic parsers, by providing realistic data for evaluation as well as inductive learning. In this paper we present what we believe to be the first robust data-driven parser for Bulgarian, trained and evaluated on data from BulTreeBank (Simov et al., 2002). The parser uses dependency-based representa...

متن کامل

Constituency Parsing of Bulgarian: Word- vs Class-based Parsing

In this paper, we report the obtained results of two constituency parsers trained with BulTreeBank, an HPSG-based treebank for Bulgarian. To reduce the data sparsity problem, we propose using the Brown word clustering to do an off-line clustering and map the words in the treebank to create a class-based treebank. e observations show that when the classes outnumber the POS tags, the results are...

متن کامل

Language Resources and Tools for the Creation of a Bulgarian Treebank

This paper describes a framework for the creation of an HPSG-based treebank of Bulgarian. The architecture consists of several types of language resources and tools, such as gazetteers, a morphological dictionary, a valence dictionary, a semantic dictionary, named entities recognition grammars, chunk grammars for NPs and VPs, a general HPSG grammar. The paper describes each of them, including t...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Special Linguistic Phenomena in the Bulgarian HPSG-based Treebank (BulTreeBank)

نویسنده

چکیده

منابع مشابه

13 HPSG - based syntactic treebank of Bulgarian ( BulTreeBank )

Practical Annotation Scheme for an HPSG Treebank of Bulgarian

A Data-Driven Dependency Parser for Bulgarian

Constituency Parsing of Bulgarian: Word- vs Class-based Parsing

Language Resources and Tools for the Creation of a Bulgarian Treebank

عنوان ژورنال:

اشتراک گذاری